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Abstract 

This paper considers the use of massively parallel architectures to execute discrete-event simulations 
of what we term “self-initiating” models. A logical process in a self-initiating model schedules its own 
state re-evaluation times, independently of any other logical process, and sends its new state to other 
logical processes following the re-evaluation. Our interest is in the effects of that communication on 
synchronization. We consider the performance of various synchronization protocols by deriving upper 
and lower bounds on optimal performance, upper bounds on Time Warp’s performance, and lower bounds 
on the performance of a new conservative protocol. Our analysis of Time Warp includes the overhead costs 
of state-saving and rollback. The analysis points out sufficient conditions for the conservative protocol to 
outperform Time Warp. The analysis also quantifies the sensitivity of performance to message fan-out, 
lookahead ability, and the probability distributions underlying the simulation. 


*A preliminary version of this paper appears in the Proceedings of the 1990 SIGPLAN PPoPP Symposium. 
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1 Introduction 

The problem of parallelizing discrete-event simulations has recently received a great deal of attention. Parallel 
simulations are typically described as a collection of Logical Processes, or LPs. Each LP maintains its own 
simulation clock, and communicates with other LPs using time-stamped messages. We assume each LP 
executes on its own processor, as it might on a massively parallel architecture. The state of an LP at 
simulation time t depends on the contents of all messages that should be sent to it with time-stamps less 
than t. There are two primary ways in which an LP re-evaluates its state. One way is epitomized by a 
queueing network simulation: a job leaving one queue ( LP ) causes the receiving queue to re-evaluate its 
state. This is an example of a message-initiating model, because state re-evaluations at an LP are caused by 
messages sent from other LPs. A different method occurs when an LP alone determines when to re-evaluate 
its state. The LP will send messages to other LPs following a re-evaluation, because those IP’s eventual 
re-evaluations will require that information. However, the state messages do not cause the recipients to re- 
evaluate their state. The messages cause events that serve only to store the transmitted state information. 
This paper concerns such models; we will call them self-initiating models. As we later discuss, this class of 
models includes problems as diverse as the Ising spin simulation[14], and trace-driven multiprocessor cache 
simulations. 

Synchronization has been a major concern of research in parallel simulation. One way of ensuring 
correctness is to block an LP from computing its state at t if there is any chance that it will later receive 
a message with time-stamp s < t. This type of blocking is an open invitation to deadlock; irregular 
and unpredictable synchronization requirements make parallelizing discrete-event simulations a non-trivial 
problem. Early research efforts focussed on developing deadlock-free synchronization protocols. Two schools 
of thought emerged. The conservative school studied protocols that maintain consistency in the simulation 
state: an LP is never allowed to advance its clock so far that it can receive a message in its past. LPs 
exploit specific information about the simulation model to avoid or break deadlock. The optimistic school 
proposed Time Warp, a scheme that permits an LP to advance its clock without blocking. When an LP 
does receive a message in its past, it “rolls back” its clock to the point of the temporal fault, and restores its 
state to one existing prior to the fault. Time Warp does not need to use specific model information. Indeed, 
a major attraction of Time Warp is its transparency to the simulation modeler. 

Before parallel machines were commonly available, the debate between conservative and optimistic camps 
was largely philosophical. Then, as performance studies were published, no clear consistently best choice 
emerged. The earliest conservative protocols of Chandy and Misra were shown to suffer from serious per- 
formance problems on some queueing network simulations [22], but have recently been shown to work well 
on road network simulations [17]. Other conservative protocols, notably [13] and [21], achieved acceptable 
performance on some problems by exploiting information about the simulation model. Time Warp too was 
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shown to achieve acceptable performance on some problems [3, 4]. The overhead costs of state-saving and 
rollback continue to be a major drawback to all optimistic schemes; hardware accelerators for these functions 
have been proposed [1, 5]. 

Throughout this debate, little analytic theory was developed to predict, explain, or bound the perfor- 
mance of parallel simulations. Exceptions are the detailed analyses developed in [9] and [18]. However, these 
studies are limited to two processors, and have not been extended. Theory for massively parallel simula- 
tions is now starting to appear. Wagner and Lazowska derive an upper bound on the speedups possible 
in a queueing network simulation [25]. Studies of Time Warp tend to assume negligible state-saving and 
rollback costs. Lin and Lazowska have shown that if Time Warp has no state-saving or rollback costs, and 
if “correct” computations are never rolled back, then Time Warp achieves optimality [11]. This is intuitive, 
because Time Warp aggressively searches for the simulation’s critical path— if it is able to do so without 
cost, its performance must be optimal. Other analyses highlight the fact that Time Warp can “guess right” 
while conservative methods must block. Lip ton and Mizell have shown that there is a certain asymmetry 
between optimistic and conservative methods: while it is possible for an optimistic method to arbitrarily 
outperform a conservative method, the converse is not true [12]. Madisetti, Walrand, and Messerschmitt [16] 
have developed a performance model that aspires to estimate the rate at which simulation time advances 
under an optimistic strategy such as Time Warp. They model the behavior of the system as a Markov chain, 
and include the cost of communication and of synchronization. Their analysis is exact for two processors, 
and approximate for a general number of processors. Their analysis is interesting in that it permits a study of 
different re-synchronization schemes. However, it does not address issues we attack directly, namely, bounds 
on optimal performance and sensitivity to message-fanout and lookahead ability. 

Analytic studies of conservative protocols [15, 20] are of synchronous protocols— a significant departure 
from the field s roots in distributed systems. These studies have established the important property that 
performance of the studied methods scales up with increasing problem size and architecture. Furthermore, 
the analysis in [20] demonstrates that as the problem size increases relative to the architecture, performance 
under the method converges to optimality. The rate of convergence depends very much on the nature of the 
stochastic processes driving the simulation. 

A number of issues have not yet been directly addressed analytically, and are the focus of this paper. 
Specifically, we place non-trivial upper bounds on optimal performance; we include the overhead costs of 
Time Warp in a model that bounds its performance from above; we study a new conservative protocol and 
place a lower bound on its performance; we give conditions under which the conservative protocol achieves 
better performance than Time Warp. In the course of these derivations we quantify (approximately) the 
sensitivity of performance to lookahead ability, message fan-out, and the probability distributions driving 
the simulation. All of these fore-mentioned factors are shown to have significant influence on performance; 
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performance improves as lookahead ability improves or as the variability of the probability distribution 
decreases, performance degrades as the message fanout increases. It is important to note that these conclu- 
sions are derived in the context of self-initiating simulation models only. Different but related results can be 
derived in the context of message-initiating models 1 . 

2 Model 

We model a parallel simulation as a collection of N logical processors ( LPs ) named LP\,.. ■ ,LP N . Each 
logical processor has its own simulation clock. LPs communicate through the exchange of time-stamped 
messages. Viewed from the perspective of simulation time, an LP advances forward by executing some 
activity that we will call a cycle. In the self-initiating models we consider, at the end of one cycle the LP 
schedules the end of the next cycle, independently of any messages it may have received from other LPs. 
We let Ci(j) denote the value of LPi s clock at the end of the jth cycle. The length of simulation time that 
LPi advances by executing its jth cycle is a random number X,j from a distribution T . Consequently, for 
every LPi and cycle j 

c,(j) = ± x >, 

k= 1 

Assuming that the time increment variables are all independent, C,(j) can be interpreted as the time of the 
jth renewal in some renewal process [24] with inter-renewal distribution P. We introduce communication 
to the model by assuming that each LPi associates a set of I< messages with the completion of each of its 
cycles. K is called the message fanout . Typically, these K messages are intended to inform "nearby” LP' s 
of the new state just computed. The arrival of such a message at an LP may cause an event, but one that 
serves only to store the transmitted value. K = 2 might be appropriate in a ID domain, A = 4 or A — 8 in 
a 2D domain, A = 6 or I< = 26 would be appropriate in a 3D domain. It is important to note that under 
our formulation these message fanouts are part of the simulation model, and hence are independent of the 
synchronization protocol used. A message associated with the completion of LP/s jth cycle has time-stamp 

CiUY 

The simulation is modeled as N statistically independent, concurrent renewal processes that communi- 
cate. Certain points in the analysis to follow are made possible by the assumption of statistical independence 
between the recipients of a common message. To support this need for independence we assume that the A 
recipients of a message are chosen uniformly at random from the set of all LPs, and that each LP indepen- 
dently chooses a new set of recipients each cycle. This assumption does not accurately model the behavior 
of any common simulation model, and is used purely to promote tractability. We have performed simula- 

1 Performance Bounds on Parallel Message-Initiating Discrete-Event Simulations , D. Nicol, in preparation 
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tion studies of our analytic model using “nearest-neighbor” communication, and have found that processor 
utilizations are only slightly higher than those achieved using the randomized communication patterns our 
analysis assumes. We may have some confidence therefore that the conclusions derived under the assumption 
of randomized communication are not completely off the mark. 

We will consider two different forms for the probability distribution T. In one form T has a continuous 
cumulative probability distribution function, implying that its associated renewal process is non-lattice[24] 2 . 
This form excludes simulation models where time-increments move forward more discretely. We therefore 
also derive results under the assumption that T is a geometrically distributed random variable, with mean 
1/p, where 0 < p < 1. 

Depending on the simulation model, it may be possible to send the messages associated with the comple- 
tion of a cycle before an LP actually executes that cycle. In some cases the content of the messages cannot be 
predicted, but the time of the messages can. In the former case we will say the simulation has full-lookahead , 
in the latter case we say it has time-lookahead . An example of a model with time-lookahead is the Ising spin 
simulation [14]. LPs model individual particles, each of which is “jiggled” by thermal effects, at random 
intervals. When a particle is jiggled its new magnetic spin is computed as a function of the spins of nearby 
atoms at that simulation time. The length of simulation time between jigglings defines a cycle. We are able 
to predict when next a particle will be jiggled — this time comes from a random number generator — but will 
not know the spins of nearby particles at that simulation time until the simulation actually advances that 
far. 

An example of a model with full-lookahead (although it’s a message initiating model) is a queueing 
network with a non-preemptive and load-independent queueing discipline. At the time a job enters service, 
say s, we can predict the time at which it will leave service, say t. In fact, we can notify the recipient queue 
of that job’s arrival at time t . This is not to say that we can actually simulate up to time t . For example, 
one of the statistics we may be interested in is the average length of the queue at the time a job departs. 
To measure the queue length at t we need to receive any additional jobs that may arrive between times s 
and t. The lookahead ability derives from the fact that arrivals between s and t in no way affect the output 
behavior of the LP at time t. 

Another example of a model with full-lookahead is a simple trace-driven multiprocessor cache simulation 
that estimates hit statistics, such as that described in [10]. An LP models one processor’s cache; cycles 
are composed of the processing of a contiguous sequence of purely local memory references terminated by a 
reference to global memory, the “time” of any reference is the number of trace references preceding it 3 . An 

2 A non-negative random variable X is non-lattice if there does not exist any real number d such that Pr{A' = nd) = 1. 

note that this property would not be satisfied by a simulation that more accurately models the advancement of time, e.g., 

one that accounts more time for a miss than a hit. A weaker form of lookahead exists where the LP can put a lower boimd on 
the time of its next global reference by assuming all local references wiU be hits. 
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LP sends messages to all other LPs whenever it makes a reference to global memory. By looking at its own 
trace the LP can predict the time and content of its future messages. Throughout this paper, protocols that 
exploit full-lookahead will do so by requiring an LP to send a message as soon as it is able to predict that 
message. We will still require that an LP not process a cycle with completion time t until all messages with 
time-stamps less than t have been received, because certain calculations internal to the LP (e.g. statistics 
gathering) may require that this monotonicity be preserved. Protocols that exploit time-lookahead do so by 
requiring an LP to send an appointment message containing the time of a future message as soon as it is 
able to predict that a message will later be sent with that time-stamp. Our analysis will be of simulations 
with full-lookahead. We will later remark on how that analysis can be extended to simulations with only 
time- lookahead. 

We assume that processing a cycle requires one tick of real time. This permits us to view the progress of 
the simulation synchronously. While an LP will read all messages sent to it, at each tick, it need not process 
a cycle every tick; in fact, synchronization constraints may prevent it from doing so. 

Some synchronization protocol must be used to ensure correctness. A conservative protocol prevents an 
LP from advancing so far that it can receive a message with a time-stamp smaller than its clock value. For 
example, imagine a situation where LPi'. s clock is s and it will increment its clock to value t on the next 
cycle it processes. Imagine that LP k will send a message to LPi with time-stamp v, s < v < t, at the end 
of the k + 4th tick. A conservative protocol will ensure that LPi is idle during ticks k + 1 through k + 4. 
An optimistic protocol may permit LPi to advance its clock during these ticks, but will then recognize a 
temporal error upon receipt of the message with time-stamp v y and roll back. A rollback at LP, can itself 
cause other rollbacks on other LPs, as false messages sent by that LP are undone. 

Our goal is not to propose a model that precisely describes all self-initiating parallel simulations, nor 
is it to analyze the most general possible class of simulation models. Self-initiating models are by no 
means the most common kind of simulations, and many simulations will not have the power of lookahead 
that we analyze. However, the analytic modeling of parallel simulations is an art in its infancy. We are 
simply trying to shed some light on a tractable style of analysis that produces reasonable (and intuitive) 
results. Even so, despite the many preceding qualifications the proposed model bears a close resemblance to 
simulations of practical interest. In particular the model accurately describes the behavior of the Ising spin 
and multiprocessor cache simulations described earlier. 


3 Optimal Performance 

Finding non-trivial upper and lower bounds on the performance one can achieve in a parallel simulation is 
an open question. We derive an upper bound on the performance any protocol can achieve under our model 
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assumptions, and derive lower bounds on the performance of a new synchronous conservative protocol. Our 
bound on the performance of optimistic protocols is independent of the message-cancellation strategy used. 

3.1 Upper Bounds 

We will present an analytic approach that provides upper bounds on optimal performance for a whole family 
of lookahead capabilities. Consider any cycle on any LP, and assume that an oracle schedules the processing 
of that cycle on the earliest possible tick such that no further messages will be received by that LP with 
a time-stamp less than the time at the end of the cycle. Under our assumptions, this scheduling policy is 
obviously optimal. Our upper bounds assume the use of this oracle. 

Different simulation models have different lookahead abilities. Some models have no lookahead, others 
are able to predict ahead one cycle, some may be able to predict multiple cycles into the future. For example, 
the ability of the multiprocessor cache simulation to predict its own future references to global memory is 
limited only by the memory required to store its trace. We will categorize these abilities by the number of 
cycles that can be predicted. A simulation model will be said to have /-cycle full-lookahead if the time and 
content of output messages associated with the completion of cycle k can be predicted at the completion of 
cycle (k — J). We assume that simulations exploiting J-cycle full- lookahead will always “pre-send” a message 
J cycles before the message’s associated cycle. 

The basic approach to constructing an upper bound is simple. The Global Virtual Time [7] at tick i 
is denoted GVT(i)- this quantity is the least clock value among all LPs at tick i, and is typically used 
to gauge the progress of the simulation. We desire to bound the limiting rate of simulation time increase, 
hm.^oo GVT(i)/i. Our approach is to bound GVT(i) by a function N(i), which is the minimum time-stamp 
among the “next” messages sent by LPs who received the minimum-time stamped message at tick i - 1. 

We then appeal to asymptotic arguments to estimate lim,^ N(i)/i, and hence bound the limiting rate of 
simulation time increase. 

The bound is constructed as follows. Let be the least time-stamp among all messages sent at tick 

i. This value need not be equal to GVT(i ), because the LP with least clock may not have sent a message, 
being prevented from doing so by the knowledge that an impending message will arrive with time-stamp 
less than its next cycle time. Let r(i,j) be the index of the jth LP (among K) who receives the message 
with time-stamp at tick i. Consider any LP r(iJ) , and suppose time ) falls within the simulation 

time span encompassed by its njth cycle. The next message LP r(i]) sends cannot have a time-stamp larger 
than C r(u) (n, + J), the time associated with the end of its (nj + J)th cycle. The gap of simulation time 
between t mtn (i) and C r{iJ) (nj+J) is composed of the sum of a number of random variables: a cycle residual 
- t m jn(i), plus J cycle time random variables ( a J-fold convolution of/ 7 ). This is illustrated by 

Figure 1 
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Figure 1: Cycle residual and lookahead cycles for LP r (i,j) > n model with J — 4 cycle full-lookahead 

+ 1) cannot be larger than the least time-stamp on any message sent at the end of tick i+ 1 by 
one of LF^ i), . . . , LP r (i,K)- Call this latter time-stamp N(i + 1). Observe that we may write 

N(i + 1) = + M K:J {i + 1), 

where 

"h 1) = nlin { C r (i t j)( n j ) — ^min(0 + Fj (*0}i 
1 ^ K 

Tj(J) being a J-fold convolution of T random variables. Finally, observe that N(i + 1) > GVT(i+ 1). Since 
our object is to bound lim,'_ 0O GVT(i)/i , it will suffice to bound lim,-_*oo N(i)/i. 

Observe that for all i > 0, 

i 

N(*)/i = 

j = l 

= ^2 ~ *) + ~ 1 )) /* 

>=1 

i 

< Y1 TO - 1) + - N(j - 1)) /» 

J = 1 

j= i 

The appendix gives heuristic reasons for expecting that if ^’s tails aren’t too large, (e.g., if T is NBUE, or has 
an increasing hazard rate function), then it is reasonable to assume that the sequence {M K ,j(i)} converges 
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into a wide-sense stationary process having finite correlation time [8]. While the supposition is technical, for 
our purposes here it implies that the limiting value of sum (1) converges to \P(A', J) = lim^oo E[Mx j(i)]. 
J) then bounds the limiting rate of increase in GVT. 

The preceding discussion leads to our first proposition. 

Proposition 1 For every tick i let < m i n (0 be the least time-stamp among all messages sent at the end of 
tick i, and let LP r (i i), . . . , LP r ^ x) be the set of LP who receive the t m in (i)-time message . Let nj be the 
cycle index of the LP r ^^ cycle containing time t min (i), and Pj(J) be a convolution of J random variables 
having distribution T . Define 

Mkj{} + 1) = ) “ ^min(0 + P j (^)} 5 

and let = lim^oo If the sequence . . . converges to a wide-sense 

stationary process with finite correlation time , then 

lim GVT(i)/i < 

i — ♦ OO 


A second proposition follows from the observation that a simulation advancing time at rate qfi hats an average 
processor utilization of q%. 

Proposition 2 Let the conditions of Proposition 1 be satisfied f and let fi be the mean of P , Then the average 
processor utilization is no greater than J)/ fi. 


We must estimate ^(/v, J) before these propositions yield any insight on performance. Reconsider the 
definition of Mk,j(i)> One takes the minimum of I\ random variables; each random variable includes the 
excess time after of the cycle containing t min (i). We call this time difference a residual. A similar 

concept is studied in renewal theory, the residual life of a renewal processes. The difference between our 
residuals and those of renewal theory is that our t min (i) is itself variable, whereas renewal theory considers the 
residual following a constant time t. However, in the Appendix we show that if T is non-lattice and t m in (i) 
is independent of LP r ^ ^, then the limiting residual life has the same distribution as that derived in renewal 
theory. This limiting distribution is the equilibrium distribution[ 24] of T, called P e . It is not completely 
unreasonable to to assume for the purposes of approximation that t min (i) is independent of n 11 the LP r ^ 
due to the fact that the set of recipients of the t mm (i)-time message were chosen uniformly at random from 


8 



the entire collection of LPs . To the extent that this is a reasonable approximation, as k grows, 
increasingly becomes the minimum of K independent and identically distributed (iid) random variables, each 
the sum of an T e random variable and an independent J-fold convolution of T. Alternately, if T is geometric 
then its residual life has the same geometric distribution, due to the memoryless property. 

Our assumption of random communication patterns is now needed again. When the number of LPs is 
large compared to A, and when the partners of each communication are chosen randomly, we may take 
the A random variables comprising Mr , j{k) as being independent. This is not rigorously true, as there is 
a very slight dependence of the residuals on the fact that their LP s did not send the f min -time message, 
since this is true of all but one LP in the entire system, the assumption of independence is reasonable. Our 
approximation of J) is denoted A(A, J), and is given by 

^(A, J) to E[mm{I< iid T t + T(J) random variables}] (2) 

= A(/\ , J). 

Throughout this paper we will implicitly assume the validity of this approximation as a hypothesis to 
each proposition. The form of the approximation is especially nice, as it permits us to analyze some special 
cases. 

3.1.1 NBUE Distributions 

The case of J = 0 is of special interest, as it concerns simulations with no lookahead ability. Furthermore, 
consider simulations where T is non-lattice and New Better Than Used in Expectation ) (NBUE) [24] 4 . Many 
common distributions are NBUE, including normals truncated to be positive, gammas, Weibulls, and sums 
of nonnegative constants with exponentials. When T is NBUE, then T t is dominated stochastically 5 by the 
exponential with mean fi. Thus, if we replace each T e random variable in the definition of A(AbO) with an 
exponential having mean fi, the resulting mean (i/I< is at least as large as A (A, J). 

Proposition 3 Suppose that T xs non-lattice and NBUE. Then A(A,0) < fi/K . The optimal processor 
utilization in a no-lookahead simulation where T is NBUE is no greater than 1/A. 


This result shows the strong influence that A has on performance when T is non-lattice — it limits processor 
utilization to 1/K. If K remains proportional to A as A increases we have the following result. 

4 A non-negative random variable X is NBUE if for all t > 0, E[X\X > t] < E[X]. In other words, the expected residual life 
of X is never greater than the expected value of X . 

5 A is said to dominate Y stochastically if Pr{A' > t} > Pr{F > t } for all t. 
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Proposition 4 Let T be non-lattice and NBUE, and suppose K > f3N for all N in a no-lookahead simu- 
lation. Then under any protocol and for any N , the average number of total cycles processed by the system, 
per tick cannot exceed 1 /f3. 


K equals N in the cache simulation we have already described, because one LP's global reference is sent to 
all other LPs. The conclusions of Proposition 4 apply whenever a synchronization protocol treats the model 
as having no lookahead. 


3.1.2 Geometric Distribution 


Propositions 3 and 4 depend on T being non-lattice. When T is geometrically distributed with mean 
p — 1/p, then the residuals defining A(/i, 0) are also geometric with mean 1/p. It is straightforward to 
compute A(A r ,0) as the expected minimum of K independent geometries. This minimum has the same 
distribution as one geometric with mean 1/(1 — (1 — p) K ). Observe that this mean is always at least one, it 
cannot diminish arbitrarily as K is increased. When either p is small or K is large the mean is very close to 
one, leading us to the next proposition. 


Proposition 5 Suppose that T is geometric with mean p = 1/p. Then 

1 


A(A',0) < 


i-(i - p ) k ’ 

and processor utilization is no greater than p/(l — (1 — p) K )- Thus, as (1 — p) K 
is no greater than p. 


0, processor utilization 


3.1.3 Constant plus Exponential Distributions 


Larger upper bounds on utilization are possible given better lookahead ability. Suppose that a simulation 
has one-cycle full-lookahead ability, and consider the family of distributions where a constant 6 is added to 
an exponential with mean p x . Within this family we can decrease the variability by increasing 8 , but still 
retain the tractability of the exponential. This family of random variables is NBUE. In the Appendix we 
show that under these assumptions 


A(AM) <<5 + 


t r(6p x +pl) 2(6 + /i r )exp 


2 K 


+ 


( 3 ) 


The interesting thing to note about this bound is that it decreases in 1 /yfR rather than in 1/A’, as in 
the no-lookhead case. This suggests that really significant performance gains may be possible when K is 
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moderately large, by exploiting one-cycle full-lookahead. However, the fact that we have increased an upper 
bound on performance does not necessarily imply that performance itself must increase. In the following 
section we address this issue by deriving a lower bound on optimal performance under the assumption of 

full-lookahead for J > 1 cycles. 

3.2 Lower Bounds 

We now derive a lower bound on optimal performance for simulation models with full-lookaheads of J > 1 
cycles. Our approach is to view synchronization as a scheduling problem, and derive the performance one 
achieves using a particular scheduling strategy. Being sub-optimal, this performance provides a lower bound 
on optimal performance. The strategy we study forms the basis for a conservative synchronization protocol. 

Consider a simulation model with ./-cycle full-lookahead. Recall that we exploit J-cycle full-lookahead 
by requiring an LP who completes cycle m to predict and send the message associated with the end of cycle 
m + J. Suppose LP \ last executed cycle m, and knows (through some as yet unspecified means) that it will 
not receive any further messages with a time-stamp t or smaller, t falling within its (m + k)th cycle. LP, 
may safely compute cycles m + 1 through m + k- 1, and in doing so predict the messages associated with 
the completion of its cycles m + 1 + J through m + * + J - 1. The idea behind our scheduling strategy is 
to define a window by defining this t. All LPs may then advance as described above, whereupon we define 
a new t and hence a new window. Our strategy defines and processes each window with the following steps. 

1. For each LP determine the time of the next message it will send. If LP, last evaluated cycle m, the 
time of the next message it will send is C,(m +J+ 1). Compute the minimum such c min among all 
LPs. c min is called the ceiling. 

2. Each LP computes all its cycles with termination times strictly less than c min . For each cycle n that 
is so processed, the LP predicts and sends the message associated with the completion of cycle n + J. 

3. Each LP accepts the messages sent in the previous step. 

This process is repeated until the simulation termination condition is reached. 

The performance of this mechanism is derived as follows. Let c min (j) be the ceiling computed during the 
jth window. The asymptotic rate at which simulation time advances is identical to the asymptotic rate at 
which c min (j) advances. Consider the jth window: every LP { computes all as-yet-unprocessed cycles with 
completion times less than c min (i). Let m, be the last cycle so processed by LP,. Note that by the end of 
the window, LPi will have sent messages associated with cycles up to m, + J. The time of the next message 
LP, will send can be expressed as the sum of c min (j), the residual of the cycle containing c min (j), and the 
cycle time increments of the following J cycles. Therefore, the difference between the time of LPfs next. 
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message and c min (j) is composed of a cycle residual plus a 7-fold convolution of cycle time increments. We 
have already seen in §3.1 that as j grows large this difference may be approximated as a random variable 

mg the distribution of T t + F(J). Then c mtn (j + 1) can be expressed as c m j n (/) plus the minimum of N 
such random variables. This shows that 

E[c min (j + 1) - c min (j)] — tf(Ar, J) as j — OO. 

No LP can process more than 7 cycles in a window, because it can never advance beyond the time of 
the next message it will send ( 7 cycles distant), computed in step 1 of the window processing. *(N,J)/J 
consequently bounds the limiting rate at which simulation time advances from below. 

Proposition 6 If the suppositions of Proposition 1 are met, then the limiting rate of simulation time advance 
using lookahead scheduling on a J -cycle full-lookahead simulation model is at least ^{N,J)/J. The limiting 
processor utilization is at least y(N, J)/(Jp). 


Now consider the special case of 7 = 1, and the 6 + exp{p x } distribution. In the Appendix we show that 

S + W 1 )- (4) 

implying that p , the average processor utilization achieved, is at least 

> £ + 

6 + Hr 

“ ' r + 1 Where r = 6/p x . 

This shows that the extreme conservatism of having every LP block on the c min -tirne messages can be highly 

tempered. If r = 6/p x is at all significant the utilizations are quite good. For example, if r = 0.25 we still 

get at least 20% utilization. Increase r to 1 and we are assured of 50% utilization, r = 10 delivers 91% 
utilization. 

The case where T is geometric is also of interest. A (TV, 7) is composed of the minimum of N random 

variables, each the sum of 7 geometric random variables. Each geometric is at least as large as one, implying 

that A(N,J) > 7. If the geometries have mean p = 1/p, Proposition 6 shows that average processor 
utilization is at least 100 * p%. 

4 Analysis of Optimistic Protocol 

We next turn to a similar analysis of optimistic protocols. These protocols are complex, especially with 
regard to the effects of cascading rollbacks. It appears to be a formidable task to put a lower bound on 
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the rate at which an optimistic protocol advances simulation time. However, it is easy to extend the ideas 
of the previous section to put an upper bound on this rate. Indeed, our ideas of focusing the analysis on 
the costs along the critical path are mirrored in Lipton and Mizell’s proof that conservative methods cannot 
arbitrarily outperform Time Warp. 

Successful conservative schemes exploit simulation model characteristics such as non-preemptive queue- 
ing, and precalculation of event duration times. One of the great hopes for optimistic protocols is that 
they can be implemented without using explicit knowledge about the simulation model. Consequently, a 
“model-independent” implementation cannot assume any lookahead. The results of the previous section 
show that this assumption immediately limits the processor utilizations that are possible. Naturally, the 
cost of state-saving and rollback limits utilizations even more. To be sure, simulations using Time Warp may 
exploit model information; indeed, users who have and use Time Warp implementations have suggested they 
should do so [2]. Note however that these types of optimizations do not alleviate the burden of state-saving. 
The arguments to follow assume that Time Warp treats the simulation model as though it has no lookahead. 

Under Time Warp an LP rolls back if it receives a message with a time-stamp less than its clock. We 
assume that an LP’s state is always saved prior to the execution of a cycle, and model that cost with ticks 
of length Cs > 1, as compared to the earlier ticks of length I. We suppose that a rollback requires Cr time, 
measured in units where processing an event (without state-saving) takes unit time. The model bounds 
the rate of simulation time increase on the critical path. The only assumption used by the analysis is that 
a “late” message causes a rollback at the receiving processor, any effects due to rollback propagation are 
ignored. A rolled-back processor will re-execute cycles it was rolled past. For this reason our analysis is 
independent of whether “aggressive” or “lazy” message cancellation[23] is used. By assuming that cycles 
passed by the rollback are reevaluated, the analysis assumes that “jump forward” mechanisms [6] are not 
used. Under an optimistic protocol a given cycle of an LP may be processed a number of times before it is 
“cast”. To avoid complications we assume that the length in simulation time of a cycle is the same, every 
time the cycle is processed. 

Suppose that a rollback is initiated at LPi at tick k. Barring any further interruptions due to cascading 
anti-messages, the rollback completes at tick k + Cr. At tick k + Cr + 1 the LP rejoins the simulation and 
communicates its K messages. The idea behind the analysis of optimal performance in §3 was to bound the 
advance in simulation time between two ticks by looking at the LPs which receive the message with least 
time-stamp at a tick. Exactly the same idea applies here, except that the increase in simulation time must 
be measured over more than one tick. 

Consider the set of LPs who receive the message with time-stamp * mm (0- Let LP m in be the LP in this 
set with the minimum cycle residual (measured from t m in(0)s an< ^ reca b that N(i) is the tick at which LP m \ n 
sends its next message. We will use these definitions to partition the running time into phases that measure 
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the time between an LP \ s receipt of a t m j n -time message, and the next tick at which the LP completes 
a cycle and sends a message. The idea of a phase is to measure the rollback delay caused by receipt of 
the <min-time message. Phase 1 encompasses ticks 1 through N( 1) — 1. Phase 2 encompasses ticks N(l) 
through N(N(l)) — 1, phase 3 encompasses ticks A^A^l)) through N(N(N( 1))) — 1, and so on. When T 
is non-lattice, any LP receiving the t m m(?‘)-time message at the end of tick i must roll back, or already be 
rolling back. If it is already rolling back and if the transmission of the tf m j n -time message to that LP is 
independent of the fact it is already rolling back, then on average the rollback is half-way completed. In 
this case the mean number of ticks in a phase must be at least 1 -f Cr/ 2. This argument requires T to be 
non-lattice, for when T is discrete it is possible for an LP receiving the least time-stamp message to have 
the same clock value as the time-stamp, possibly making a rollback unnecessary. 

Let P(i) be the number of phases that have completed by tick i, c, be the tick that completes the phase 
containing i , and let tj be the tick that completes the jth phase. Then 


GVT(i)/i < GVT(ci)/i < N( Ci )/i 

Mo+i \ 

< I E m k,jU)\ a 

/ £fii )+1 fP(i) + 1\ 

V P(i) +1 ) V *■ / 

- { P(l)+1 ) 

We have already identified conditions under which the leftmost quotient converges to A(/i,0). Under similar 
conditions on the length of phases the rightmost quotient will converge to the reciprocal of the mean number 
of ticks per phase. Recall that each tick is a factor of Cs slower due to the cost of state-saving. This 
argument proves our next proposition. 


Proposition 7 Suppose the sequences {Mx,j(i)} and {ti+i— <,} converge to respective wide-sense stationary 
processes with finite correlation time. Let C Rh = C R / 2. If Time Warp treats a simulation model as though 
it has no lookahead then 


Jim GVT(i)/i < 


*/ -F 25 non-lattice 
if T is discrete 


Consequently , the processor utilization is no greater than ^CsIcr^i) w ^ €n ? ls non-lattice, and is no greater 
than when T is discrete. 
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An easy upper bound can be put on Time Warp even when it exploits lookahead, because every LP 
incessantly saves state. Each tick some LP executes a cycle, and first saves state, so that every processing 
tick is delayed by a state-save. Therefore, Time Warp’s performance cannot be any better than a factor of 
Cs worse than optimal. 

Proposition 8 The optimal rale of simulation time advance under Time Warp on a simulation model having 
J. cycle full-lookahead is no greater than A; average processor utilization is no greater than 


5 A Conservative Protocol 

The promise of (sometimes) good performance achieved by the scheduling strategy described in §3.2 sug- 
gests its use as the basis for a synchronization protocol. Unlike many conservative protocols, this one is 
synchronous, in that the computation of the ceiling value implicitly contains a global synchronization among 
processors. This synchronization is all that is needed to implement the policy. The lower bound on perfor- 
mance we derived in §3.2 must change to accommodate the cost of computing the ceiling. Define C G > 1 
so that a processor is engaged in synchronization overhead 100(1 - l/C G )% of the time. Equivalently, one 
can view the ticks as being C G percent of the length of our earlier ticks, due to the overhead of synchroniza- 
tion. Depending on the granularity of the event computation the delay cost of synchronization can be quite 
small, as most richly connected architectures such as a hypercube can compute a global minimum in log A 
steps. Some architectures such as the second generation Connection Machine already have hardware support 
for common global reductions like the minimum. Including this synchronization cost, the lower bound on 
processor utilizations becomes ^(N , J)/(C G J p). 

Consider again a simulation with 1-cycle full-lookahead with * + exp{p*} time increments. Using approx- 
imation (3) and inequalities (3) and (4) we can put a lower bound on the ratio of the conservative protocol’s 
utilization to optimal utilization. Table 1 plots this bound as a function of 6 and K, for fixed n* - E C G = 2 
and N = 65536. 

Relatively good performance is possible when 6 is non-trivial relative to /i* and/or when K is large, even 
though synchronization overheads are 50%. However, if the upper bound on optimal performance is at all 
tight there is clearly room for significant improvement. It is here that the extreme conservativeness of having 
every LP wait for the least-time future message hurts. It may be that more complex protocols such as the 
bounded-lag protocol [13] could significantly boost performance in this region. 

We can determine situations where this conservative protocol achieves better performance than Time 
Warp. Assume the validity of approximation (3), and assume that T has the 6 + exp{/z*} distribution. 
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I<\6 

0.00 

0.10 

0.5 

1.00 

5.0 

10.0 

2 

0.001 

0.021 

0.076 

0.116 

0.208 

0.237 

4 

0.002 

0.034 

0.119 

0.176 

0.295 

0.330 

8 

0.003 

0.053 

0.173 

0.243 

0.367 

0.399 

16 

0.004 

0.079 

0.231 

0.303 

0.409 

0.435 

32 

0.007 

0.114 

0.284 

0.348 

0.434 

0.453 


(Conservative utilization)/(Optimal utilization) 


Table 1: Approximated lower bound on fraction of optimal performance achieved by the conservative protocol 
when fj, x = 1, C G - 2 and N — 65536. 


This distribution is N3UE, whence by Propositions 3 and 7, l/(Cs(Cfth 4- 1)A') is an upper bound on Time 
Warp s utilization. From this, Proposition 6, and equation (4) we determine that the conservative protocol 
achieves better performance than Time Warp whenever 

Cs{Crh 4- 1) ~ S + v x 

Note that this inequality assumes that Time Warp is not exploiting lookahead. 

Estimates for individual process state sizes in the near term at the Jet Propulsion Lab are from 4K up 
to 1M bytes [1]. For 4K state sizes, it is estimated that 90% of a processor’s time could be devoted to 
saving state. Using Time Warp on these production problems without the benefit of hardware accelerators, 
Cs = 10 is apparently a reasonable value. 

Physical processes modeled by LPs very rarely have zero duration times. Many modeled processes exhibit 
a fixed startup cost, e.g. chocking a bit into a drill in a manufacturing simulation. Therefore, non-zero values 
of S seem reasonable in practice. Relatively large values K are also common, especially in domain oriented 
simulations where domain sectors are LPs that communicate in a nearest neighbor pattern. 

Table 2 plots the ratio of the lower bound on the conservative method’s utilization to the upper bound 
on Time Warp utilization, as a function of 6 and K for fixed fi x = 1, N = 65536, C G = 2, and C Rh = 0. 
One set of data assumes that Cs — 10. Another assumes that Cs ~ 2, making state-saving comparable to 
the cost of a global synchronization. 

The performance difference is not so great, when T is geometric. Using Propositions 5 and 7 we bound 
Time Warp’s utilization from above by p/{Cs{ 1 - (1 - p) K ). A simple lower bound on the utilization of 
the conservative protocol is p/Cq . Table 3 plots the ratio of these bounds for the same set of parameter 
values as did Table 2. The values of p are chosen to yield the same mean values of T as those in Table 2. 
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K\6 

0.00 

0.10 

0.50 

1.00 

5.0 

10.0 

2 

0.049 

0.954 

3.366 

5.024 

8.341 

9.095 

4 

0.098 

1.907 

6.732 

10.049 

16.683 

18.191 

8 

0.196 

3.814 

13.464 

20.098 

33.366 

36.381 

16 

0.392 

7.629 

26.928 

40.196 

66.732 

72.763 

32 

0.783 

15.258 

53.856 

80.392 

133.464 

145.526 


(Conservative utilization) /(Time Warp utilization) 
6 + exp{^x) distribution, high state-saving costs 


K\6 

0.00 

0.10 

0.50 

1.00 

5.0 

10.0 

2 

0.010 

0.191 

0.673 

1.005 

1.668 

1.819 

4 

0.020 

0.381 

1.346 

2.010 

3.337 

3.638 

8 

0.039 

0.763 

2.693 

4.020 

6.673 

7.276 

16 

0.078 

1.526 

5.386 

8.039 

13.346 

14.553 

32 

0.157 

3.052 

10.771 

16.078 

26.693 

29.105 


(Conservative utilization) /(Time Warp utilization) 
6 +exp{p r } distribution, low state-saving costs 


Table 2: Comparison of conservative protocol and Time Warp when T has the 6 + exp{p r } distribution, 
i = 65536, Crh = 0. High state-saving costs modeled with Cs = 10, low state-saving costs with 

C$ = 2. 


Despite the better showing by Time Warp, in most of the cases shown the conservative method compares 
favorably with Time Warp. The insensitivity of the conservative method to fanout is a direct consequence 
of its implicit assumption assumption that the next message to an LP can come from anywhere. This is 
equivalent to assuming a fanout of N . 

Simulation studies suggest that our upper bound on Time Warp’s performance is somewhat larger than 
the observed performance. Figure 2 illustrates the point by plotting the measured (simulated) performance of 
Time Warp and our conservative method on the analytic model. Comparable overheads are used ( C s = C G = 
2), the conservative method exploits a 1-cycle full-lookahead model while Time Warp does not. Aggressive 
cancellation is used in the Time Warp simulation. 
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K\p 

i/i 

i/i.i 

1/1.5 

1/2 

1/6 

l/n 

2 

5.000 

4.959 

4.444 

3.750 

1.528 

0.868 

4 

5.000 

5.000 

4.938 

4.688 

2.589 

1.585 

8 

5.000 

5.000 

4.999 

4.980 

3.837 

2.667 

16 

5.000 

5.000 

5.000 

5.000 

4.730 

3.912 

32 

5.000 

5.000 

5.000 

5.000 

4.985 

4.763 


(Conservative utilization)/(Time Warp utilization) 
geometric distribution, high state-saving costs 


I<\p 

i/i 

1/1.1 

1/1.5 

1/2 

1/6 

i/n 

2 

1.000 

0.992 

0.889 

0.750 

0.306 

0.174 

4 

1.000 

1.000 

0.988 

0.938 

0.518 

0.317 

8 

1.000 

1.000 

1.000 

0.996 

0.767 

0.533 

16 

1.000 

1.000 

1.000 

1.000 

0.946 

0.782 

32 

1.000 

1.000 

1.000 

1.000 

0.997 

0.953 


(Conservative utilization)/(Time Warp utilization) 
geometric distribution, low state-saving costs 


Table 3: Comparison of conservative protocol and Time Warp when T has geometric distribution. pi x = 1 
N = 65536, C Rh = 0. High state-saving costs modeled with C s = 10, low state-saving costs with C s = 2. 


6 General Application 

The conservative protocol described earlier has broader application than just to our simple analytic model. 
Its principles form the basis of a parallel simulation testbed we have implemented on an Intel iPSC/2 [19]. 
The key idea to making efficient use of such a coarse-grained machine is aggregating large numbers of LPs for 
evaluation on each processor. One advantage to aggregation is that a model which suffers very low processor 
utilizations when each LP has its own processor can achieve good processor utilizations on a coarse grained 
machine. For example, consider a model with 65536 LPs, which gets 1% utilization on an architecture with 
65536 processors. Evaluate that model on a machine with 64 processors, and on average each processor 
will have more than 10 events to process each synchronization window. Indeed, the results developed in the 
framework of a more complex stochastic model show that given 1-cycle full-lookahead, performance of our 
method approaches optimality as the size of the problem is increased relative to the architecture [20], These 
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Figure 2: Empirical comparison of Time Warp with aggressive cancellation and conservative protocol. Con- 
servative protocol exploits 1-cycle full-lookahead, Time Warp does not. Overheads are equivalent: C G = 2, 
C s = 2. N = 65536, communication patterns are random. 6 = 0.25, n x = 1. 

conclusions are supported by experiments on the testbed where we have achieved processor utilizations in 
the range of 60%-85% using 32 processors on large queueing network, logic network, and cellular automaton 
simulations. The performance degradation there is not due to blocked processors, it is due to communication 
and synchronization overheads. 

The results reported here are easily adapted to simulation models having only time-lookahead. The 
conservative protocol is modified so that promises of future messages are sent as soon as possible, and then 
the messages themselves are sent upon completion of the appropriate cycle. The windows are defined as 
before (the ceiling is the least next appointment time to be sent), but instead of processing all events in 
one pass, the protocol iterates over the window. Each iteration, computations that are assured of no future 
messages (as established by the lookahead message times) are performed. These create messages that will 
“free” other computations that depend on them. The analysis goes through as before, except that the 
increase in simulation time must be amortized over the average number of iterations in a window. We have 
not yet firmly established this value, but heuristic arguments suggest that it is 0(log A ). 
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7 Conclusions 

This paper proposes and analyzes an intuitive model of massively parallel discrete-event simulations. We 
derive non-tnvial upper and lower bounds on optimal performance for certain classes of simulations, derive 
an upper bound on Time Warp’s performance, and derive a lower bound on the performance of a newly 
proposed conservative method. These results permit a derivation of sufficient conditions for the conservative 
method to outperform Time Warp. 

Our analysis quantifies the dependence of performance on the time-increment distribution, showing that 
distributions with significant constant components lead to good performance. We also determine the sensitiv- 
ity of performance to lookahead, and to message fan-out. Unfortunately, our results rest on approximations 
which are justified only heuristically, although there is excellent agreement between analytic and empirical 
results. Future research may be directed towards firming up the foundations of our approach. 

Our results are significant in two ways. To our knowledge this is the first analysis able to analytically 
compare the performance of a synchronization protocol on a stochastic model with a non-trivial bound on 
the optimal performance one can achieve. It is also significant that we are able to classify simulation models 
under which a conservative method has provably good performance. To be sure, there are a large number of 
simulation models where our protocol will fail miserably, and there are a large number of models which lack 
the lookahead demanded by our method. Nevertheless, a better understanding of the complex behavior of 
parallel simulations demands analysis, and this paper is an early efTort at providing that analysis. 
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Appendix 

In this appendix we derive a number of results too detailed to include in the body of the paper. 


Limiting Distribution of Residual 


Let X(t) and y(t) be independent renewal processes having non-lattice inter-renewal distribution F and G 
respectively. Let R x (t) denote the residual life at t of the process X(t ) — the remaining time until the next 
renewal. It is known [24] that as t grows large Rx(t) converges in distribution to F y s associated eqvilbrium 
distribution F e : 

[CO 

Pr {F e > s] = I Pr{F > u}/ ix du 

where /i is F’s mean. Now let Sj be the time of the jth renewal in y(f). We will sketch an argument showing 
why the limiting distribution of R x (Sj) as j — ► oo is also F e . 

To show convergence we must demonstrate that for every x > 0 and e > 0 there exists a j € such that for 
all j > j £ , 

\Pr{R x (S J )>x}-Pr{F e >x}\<c. 


Choose any x and e. Let gj(t) be the density function of Sj (a j-fold convolution of G). By the independence 
of X(t) and Y (t) we may write 


and 


/•OO 

Pr{i2 x (5 j )>x}= / Pr{i2x(<) > £}&(<) dt, 

Jo 

f°° 

| PT{R x (Sj) > x} - Pr {F e > x}| = / I Pr {R x (t) > x} - Pr {F e > x}\gj(t) 

Jo 


dt . 


Because Rx(t) converges in distribution to F e as t — ► oo, we may choose t € so large that for all t > t e , the 
absolute difference inside the integral above is no greater than e/2. We may also choose some j € so large 
that Pr { Sj < f £ } < e/2 for all j > j € . In this case for all j > j ( 


[CO 

/ I Pr {R X (t) 

Jo 


> x} - Pr{T e > ar}|^(<) dt 


< 

< 


( Qj(t) dt + 
Jo 

e/2 + e/2. 



dt 


This demonstrates that the distribution of Rx(Sj) converges to F e . 


limj-HX) ELi Mk,j(i)/i -+ *(K, J ) 

Here we describe reasonable conditions under which the limiting average of the sequence { A/x j(?)} converges 
to ^(K , J). View the sequence M# j{ 1), A/^ j(2), . . . , as a discrete-time stochastic process {M^j(i)}. For i 
large it is reasonable to assume that this process is stationary in the wide-sense[8], meaning that there exists 
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values ^k,j an ^ C 2 K 3 such that E[MKj{i)\ = 9 k } j < oo and E[MK,j(i ) 2 ] = C\ 3 < oo for all * sufficiently 
large, and that Cov[Afjc t j(i), Mk,jU)] is a function only of |i — j\. Furthermore, we expect that Mx y j(i) 
and Mk,jU) should become independent as | j - i\ grows. If they become independent quickly enough, we 
will have 

oo 

£ \E[M Ki j(i)MKAj)]-*K,A<°°- ( 5 ) 

j=i + 1 

If this inequality holds true (for a general wide-sense stationary process) the process is said to have a finite 
correlation time . 

We have not developed formal arguments that {A//c ( j(i)} becomes wide-sense stationary with a finite 
correlation time. However, we can give heuristic reasons why it is reasonable to assume so. Mxj(i) is the 
minimum of K random variables, each comprised of the sum of a residual plus the sum of J cycle times. 
Among all these let TWx he the maximum index of a cycle time random variable (or residual) appearing 
in Mx,j(i). Now suppose the time-increment distribution is exponential. Any composed entirely 

of random variables from cycles with indices greater than n max is independent of owing to the 

memoryless property of the exponential. Let D be the random number of ticks that pass after i before every 
LP has evaluated cycle n max . Then we have 

oo i+D 

i=*+i i=«+ 1 

Since M K J {i) and have the same distribution, we must have E[M K j{i)Mic t j{j)] < C# j. Thus 


i+D 

J2 \E[M Kt j(i)M Kt jU)] - W^) 2 | < E[D] (- C% tJ - * 2 KiJ ) < oo 

j =»+i 

provided that E[D ] is finite. Assuming that a serial simulation will always advance a given finite amount of 
simulation time in a finite expected number of cycle executions, E[D] will be finite. This is true because a 
serial simulation will always advance the LP with least next message time each tick, and in a finite expected 
number of steps will advance each LP at least once. E[D] is no larger than this expectation, and is hence 
finite. This argument rests on the fact that Mk } j(1) an d become independent once j is large 

enough. The independence is an artifact of the exponentiality. Intuitively, if the tail of T cannot become too 
large (e.g. if T is NBUE or if it has an increasing hazard rate function), it is reasonable to expect rapidly 
diminishing correlation and hence a finite correlation time. Such technical details appear to be difficult to 
establish. 

The result we want follows if {Mk,j( 0} wide-sense stationary with a finite correlation time. Let 
A/ X j(?) be the average M K j value taken over the first i ticks: 




Ek MkMJ) 

i 
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From [8](p. 484) we find that 


lim E{(M K ,j{i)-V{K,J)) 2 } = a. 

t—i ►CO 

This condition is sufficient strong for us to take J) as the limiting value of 

Expected Minimum of Exponential Sums 

Next we derive upper and lower bounds on the expected minimum of n independent and identically random 
variables, each constructed by adding two independent exponentials which need not have the same mean 
(this is a slight generalization of an Erlang-2 distribution). 

Our approach is to analyze the hazard rate function of a single exponential-sum. We will construct one 
random variable with a larger hazard rate, and one with a smaller one. The former is stochastically smaller 
than the exponential-sum, hence the minimum of n such is stochastically smaller than the minimum of n 
exponential-sums. Similarly, the minimum of n independent random variables that stochastically dominate 
an exponential-sum will stochastically dominate the minimum of n exponential-sums. 

Let Ai, A 2 with Ai > A 2 be the exponential parameters for exponentials X\ and X 2 . The hazard rate 
function A,(tf) of X\ 4- X 2 is found by considering a two-stage process where the the first stage requires X x 
time and the second stage requires X 2 . Intuitively A^(t) is the instantaneous probability density associated 
with the process finishing at t, given that it has not yet finished. Condition on whether Xi < t: if so, then 
A,(<) is A 2 because the process is in the second stage, if not it is zero because the process has not yet finished 
the first stage. Thus we have 


A*(f) — (1 — PrjXi > t\X\ -j- X 2 > t)X 2 . (6) 

An equivalent (but much nastier) expression for A is derivable from first principles, taking the quotient of 
the density function of X\ 4- X 2 to the probability that X\ +X 2 > t. Our expression is more convenient, in 
that it suggests simple ways in which A,(f) can be bounded from above and below. 

An upper bound on A s (t) is constructed by observing that the conditional probability in equation (6) is 
at least as large as Pr{A"i > f} = exp{-Aif}. Thus 


MO < (1 - exp{-Ai*})A 2 . 


This latter function is concave in t and is hence dominated everywhere by the line tangent to it at t = 0: 
MO = A random variable with hazard rate function hi(t) is therefore stochastically dominated by 

Xi + X 2 . 

An lower bound on X 3 (t) is constructed by exploiting the fact that 

Pr^Yi > 0 


Pr{Ai >t\X 1 +X 2 >t} = 


Fr{X x + X 2 > t] 
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< — P r {^i > £} — where X 3 has the distribution of X\ 

~ Pr {Xi + X 3 >t} 

1 


1 + \\t 


Consequently, 

This function is increasing and concave in t. It equals A 2/2 when t = 1/A*. Consequently, this function 
dominates the piecewise linear function /i u (<) which rises linearly with slope AiA 2 /2 until t — 1 / A 1 , and then 
is the constant A 2 / 2 . 

Let Zi,...,Z n be n independent random variables with hazard rate h u (t). The hazard rate of the 
minimum of these is simply n • h u (t). To compute the expected minimum we use a well-known relationship 
between a random variable's hazard rate function and its cumulative distribution function. We have 

P 00 

E[min{Zi,...,Z „ }] = / Pr{min{Zi, . . . , Z n ] > i} dt 

Jo 

p 00 pt 

— I exp{— I n • h u (s)ds} dt 
Jo Jo 

p l/Ai roo 

= / exp{— nAj A 2 < 2 /4} dt + / exp{— (tfnA 2 /2 — nA2/(4Ai))} dt 

Jo J l/Ai 

f l/Xl r x x j2 / ii ^ , 2exp{— nA 2 /(4Ai)} 

= / exp{-nAiA 2 < /4} dt 4 ■ 

Jo v nA 2) 

To evaluate the remaining integral we make the change of variables s — <>/nAiA 2 /2, and discover that 

p \ j \ \ j 2 f V n i 2 

J exp{— nAi A2< 2 /4} dt = \j J exp{-s 2 /2} ds 


-/ 


where ©() is the cumulative distribution function for a standard normal. This gives the upper bound 
£’[minimum of n iid exponential-sums] < \J n \ ^ 




< 


7 r 2exp{— nA 2 /(4Ai)} 


( 7 ) 


nAiA 2 ' (nA 2 ) 

A lower bound is found similarly. The hazard rate for the minimum of n independent random variables 
having hazard rate function hj(t) is n * Integrating as we did to derive the upper bound, we determine 

a lower bound of 


2nXi A; 


< ^[minimum of n iid exponential-sums] 


(8) 
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Bounds on 6 + exp(/^) Distribution 

Next we consider some bounds on A(A', 1) derivable when the time-increment distribution is a constant 8 
plus an exponential with mean fi X) and when the simulation has 1-cycle full-lookahead. 

A (K, 1) is the expected minimum of K random variables, each comprised of a residual plus a time- 
increment value. The residual has the distribution of the equilibrium distribution of the 6 + exp(^ r ) distri- 
bution. We first consider this equilibrium distribution. Working directly from definitions [24], we determine 
that its hazard rate is 

for t < 8 
for t > 8 

Since h(t) > 1/(8 + fj. x ) for all t , the equilibrium distribution is stochastically dominated by an exponential 
with mean 8 + /i*. Let R t be a residual having this equilibrium distribution, R f be an exponential with 
mean 8 + fi x , and X, be exponential with mean jJt x . Then the sum R{ + 8 + X, is stochastically dominated 
by R- -1- 8 + , and 

E[ min {R{ + 8 + X,}] < 8 4- ^[minimum of K iid exponential-sums, parameters 1/(8 + fi x ) and 1 /^ r ] 

1 ^ iX X 

+ vl) + 2 (^ + ^) eX P{4(7+p x )} 

2n n 

Ri stochastically dominates an exponential with mean fi x . Consequently a lower bound on the minimum 
of interest is found by replacing Ri with such an exponential. Then 



MO 


+ — t 

1 


E[ min {Ri + X,} > 

> 


8 -f ^[minimum of K iid exponential-sums, both parameters l/fi x ] 



27 



NASA 

Maloti .^o^autcs anc 
Sr-Ce Aor”"TrS!r3!>O n 


Report Documentation Page 


1. Report No. 

NASA CR- 182010 
ICASE Report No. 90-21 


2. Government Accession No. 


4. Title and Subtitle 

PERFORMANCE BOUNDS ON PARALLEL SELF-INITIATING 
DISCRETE-EVENT 


7. Author(s) 

David M. Nicol 


9 Performing Organization Name and Address 

Institute for Computer Applications in Science 

and Engineering 

Mail Stop 132C, NASA Langley Research Center 
Hampton, VA 23665-5225 


12. Sponsoring Agency Name and Address 

National Aeronautics and Space Administration 
Langley Research Center 
Hampton, VA 23665-5225 


3. Recipient's Catalog No. 


5. Report Date 

March 1990 


6. Performing Organization Code 


8. Performing Organization Report No. 
90-21 


10. Work Unit No. 
505-90-21-01 


11. Contract or Grant No. 

NAS1-18605 


13. Type of Report and Period Covered 
Contractor Report 


14. Sponsoring Agency Code 


15. Supplementary Notes 

Langley Technical Monitor: 
Richard W. Barnwell 


Submitted to ACM Transactions 
on Modelling and Computer 
Simulations 


Final Report 


16. Abstract 

This paper considers the use of massively parallel architectures to execute 
discrete-event simulations of what we term "self-initiating" models. A logical 
process in a self-initiating model schedules its own state re-evaluation times, in- 
dependently of any other logical process, and sends its new state to other logical 
processes following the re-evaluation. Our interest is in the effects of that 
communication on synchronization. We consider the performance of various s y nc ^on- 
ization protocols by deriving upper and lower bounds on optimal performance, upper 
bounds on Time Warp’s performance, and lower bounds on the performance of a new 
conservative protocol. Our analysis of Time Warp includes the overhead costs of 
state-saving and rollback. The analysis points out sufficient conditions for the 
conservative protocol to outperform Time Warp. The analysis also quantifies the 
sensitivity of performance to message fan-out, lookahead ability, and the probabil- 
ity distributions underlying the simulation. 


17. Key Words (Suggested by Author(s)) 

discrete— event simulation, parallel 
algorithms, performance analysis 


18. Distribution Statement 

62 - Computer Systems 
66 - Systems Analysis 

Unclassified - Unlimited 


19. Security Classif. (of this report) 
Unclassified 


20. Security Classif. (of this page) 
Unclassified 


21 


No. of pages 

29 


22. Price 

A0 3 


NASA FORM 1626 OCT 86 


NASA-L»ngley, 1990 






